{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pvKWP97u6jYY"
   },
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8-yiVTRl6jYY"
   },
   "source": [
    "In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.\n",
    "\n",
    "In the following wage example,  𝑌  is the (log) hourly wage of a worker and  𝑋  is a vector of worker's characteristics, e.g., education, experience, sex. Two main questions here are:\n",
    "\n",
    "* How to use job-relevant characteristics, such as education and experience, to best predict wages?\n",
    "\n",
    "* What is the difference in predicted wages between male and female workers with the same job-relevant characteristics?\n",
    "\n",
    "In this lab, we focus on the prediction question first.\n",
    "\n",
    "## Data\n",
    "\n",
    "The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below  3 .\n",
    "\n",
    "The variable of interest  𝑌  is the (log) hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size  $n=5150$ ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qb_bGht76jYb"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import statsmodels.api as sm\n",
    "import sklearn.linear_model as lm\n",
    "import statsmodels.formula.api as smf\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.model_selection import train_test_split\n",
    "import warnings\n",
    "# ignore potential convergence warnings; for some small penalty levels,\n",
    "# tried out, optimization might not converge\n",
    "warnings.simplefilter('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "20qWNJ6I6jYb"
   },
   "source": [
    "## Data Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2vf4YgrS6jYc"
   },
   "outputs": [],
   "source": [
    "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv\"\n",
    "df = pd.read_csv(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bi7bgBZd6jYd"
   },
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_5hnESw-6jYf"
   },
   "source": [
    "### Construct variables\n",
    "\n",
    "We are constructing the output variable  $Y$  and the matrix  $Z$  which includes the characteristics of workers that are given in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UP1QSwi16jYf"
   },
   "outputs": [],
   "source": [
    "Y = np.log(df['wage'])\n",
    "Z = df.drop(['wage', 'lwage'], axis=1)\n",
    "Z.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "W38QGEM56jYf"
   },
   "source": [
    "For the outcome variable (log) wage and a subset of the raw regressors, we calculate the empirical mean and other empirical measures to get familiar with the data.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_8ZXZCt16jYg"
   },
   "outputs": [],
   "source": [
    "Z.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "q6Gn9T7I6jYg"
   },
   "source": [
    "E.g., the share of female workers in our sample is ~44% ( 𝑠𝑒𝑥=1  if female)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "t3CSumTv6jYg"
   },
   "outputs": [],
   "source": [
    "# if you want to print this table to latex\n",
    "print(Z.describe().style.to_latex())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2vcHDz8x6jYh"
   },
   "source": [
    "## Prediction Question"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eqPNtBFR6jYh"
   },
   "source": [
    "Now, we will construct a prediction rule for hourly (log) wage  $Y$, which depends linearly on job-relevant characteristics  $X$:\n",
    "\n",
    "$$\n",
    "𝑌=\\beta′𝑋+𝜖.\n",
    "$$\n",
    "\n",
    "\n",
    "Our goals are\n",
    "\n",
    "* Predict wages using various characteristics of workers.\n",
    "\n",
    "* Assess the predictive performance of a given model using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.\n",
    "\n",
    "\n",
    "Toward answering the latter, we measure the prediction quality of the two models via data splitting:\n",
    "\n",
    "- Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we might consider).\n",
    "- Use the training sample to estimate the parameters of the Basic Model and the Flexible Model.\n",
    "- Before using the testing sample, we evaluate in-sample fit.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "T5wzCJyFthtM"
   },
   "outputs": [],
   "source": [
    "train, test = train_test_split(df, test_size=0.20, random_state=123)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d2-Aghcrt_IA"
   },
   "source": [
    "\n",
    "We employ two different specifications for prediction:\n",
    "\n",
    "\n",
    "1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators and regional indicators).\n",
    "\n",
    "\n",
    "2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus a dictionary of transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of a polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.\n",
    "\n",
    "Using the **Flexible Model** enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver higher prediction accuracy but are harder to interpret."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Xe3ecrBwvIf8"
   },
   "source": [
    "## Data-Splitting: In-sample performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KfS0dQlqvKvq"
   },
   "source": [
    "Let us fit both models to our data by running ordinary least squares (ols):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ww9XEg9q6jYh"
   },
   "outputs": [],
   "source": [
    "# 1. Basic Model\n",
    "model_base = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)'\n",
    "base = smf.ols(model_base, data=train)\n",
    "results_base = base.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "38uUus5amUnU"
   },
   "outputs": [],
   "source": [
    "rsquared_base = results_base.rsquared\n",
    "rsquared_adj_base = results_base.rsquared_adj\n",
    "mse_base = np.mean(results_base.resid**2)\n",
    "mse_adj_base = results_base.mse_resid\n",
    "print(f'Rsquared={rsquared_base:.4f}')\n",
    "print(f'Rsquared_adjusted={rsquared_adj_base:.4f}')\n",
    "print(f'MSE={mse_base:.4f}')\n",
    "print(f'MSE_adjusted={mse_adj_base:.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "kqtHcMdQmUnU"
   },
   "outputs": [],
   "source": [
    "# verify the formulas\n",
    "X, y = base.data.exog, base.data.endog\n",
    "n, p = X.shape\n",
    "mse = np.mean((y - results_base.predict(X, transform=False))**2)\n",
    "mse_adj = mse * n / (n - p)\n",
    "rsquared = 1 - mse / np.var(y)\n",
    "rsquared_adj = 1 - mse_adj / np.var(y)\n",
    "print(f'Rsquared={rsquared:.4f}')\n",
    "print(f'Rsquared_adjusted={rsquared_adj:.4f}')\n",
    "print(f'MSE={mse:.4f}')\n",
    "print(f'MSE_adjusted={mse_adj:.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ydQNNmzs6jYi"
   },
   "outputs": [],
   "source": [
    "# 2. Flexible Model\n",
    "model_flex = ('lwage ~ sex + shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we '\n",
    "              '+ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)')\n",
    "flex = smf.ols(model_flex, data=train)\n",
    "results_flex = flex.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "o8eSXAqUmUnV"
   },
   "outputs": [],
   "source": [
    "rsquared_flex = results_flex.rsquared\n",
    "rsquared_adj_flex = results_flex.rsquared_adj\n",
    "mse_flex = np.mean(results_flex.resid**2)\n",
    "mse_adj_flex = results_flex.mse_resid\n",
    "print(f'Rsquared={rsquared_flex:.4f}')\n",
    "print(f'Rsquared_adjusted={rsquared_adj_flex:.4f}')\n",
    "print(f'MSE={mse_flex:.4f}')\n",
    "print(f'MSE_adjusted={mse_adj_flex:.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2uxr3OZT6jYj"
   },
   "source": [
    "#### Re-estimating the flexible model using Lasso\n",
    "We re-estimate the flexible model using Lasso (the least absolute shrinkage and selection operator) rather than ols. Lasso is a penalized regression method that can be used to reduce the complexity of a regression model when the ratio $p/n$ is not small. We will introduce this approach formally later in the course, but for now, we try it out here as a black-box method.  \n",
    "\n",
    "\n",
    "We use the statsmodels package with the formula api for uniformity in feature construction and the sklearn Lasso with cross-validation to tune the regularization hyperparameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ynf9WMhT6jYj"
   },
   "outputs": [],
   "source": [
    "# Lasso with cross-validation\n",
    "X = flex.data.exog[:, 1:]  # exclude the intercept; we don't want the lasso to penalize the intercept\n",
    "y = flex.data.endog\n",
    "\n",
    "# train model using Lasso with cross validation and variable normalization\n",
    "lasso = Pipeline([('scale', StandardScaler()),  # standardize the variables\n",
    "                  ('lasso', lm.LassoCV())])\n",
    "lasso.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5RQc_tT1mUnV"
   },
   "outputs": [],
   "source": [
    "# verify the formulas\n",
    "n, p = X.shape\n",
    "p += 1\n",
    "mse_lasso = np.mean((y - lasso.predict(X))**2)\n",
    "mse_adj_lasso = mse_lasso * n / (n - p)\n",
    "rsquared_lasso = 1 - mse_lasso / np.var(y)\n",
    "rsquared_adj_lasso = 1 - mse_adj_lasso / np.var(y)\n",
    "print(f'Rsquared={rsquared_lasso:.4f}')\n",
    "print(f'Rsquared_adjusted={rsquared_adj_lasso:.4f}')\n",
    "print(f'MSE={mse_lasso:.4f}')\n",
    "print(f'MSE_adjusted={mse_adj_lasso:.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "slxv2Km-6jYm"
   },
   "outputs": [],
   "source": [
    "# store the results in a table\n",
    "res_df = pd.DataFrame()\n",
    "\n",
    "res_df['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']\n",
    "\n",
    "res_df['p'] = [results_base.params.shape[0],\n",
    "               results_flex.params.shape[0],\n",
    "               results_flex.params.shape[0]]\n",
    "\n",
    "res_df['R2'] = [rsquared_base, rsquared_flex, rsquared_lasso]\n",
    "res_df['MSE'] = [mse_base, mse_flex, mse_lasso]\n",
    "\n",
    "res_df['adj_R2'] = [rsquared_adj_base, rsquared_adj_flex, rsquared_adj_lasso]\n",
    "res_df['adj_MSE'] = [mse_adj_base, mse_adj_flex, mse_adj_lasso]\n",
    "\n",
    "# Show results\n",
    "res_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9mZfw8rQ6jYm"
   },
   "outputs": [],
   "source": [
    "# print to Latex\n",
    "print(res_df.style.to_latex())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tXtA8vw9waQ1"
   },
   "source": [
    "Considering the measures above, the flexible model performs slightly better than the basic model.\n",
    "\n",
    "As $p/n$ is not large, the discrepancy between the adjusted and unadjusted measures is not large. However, if it were, we might still like to apply **data splitting** as a more general procedure to deal with potential overfitting if $p/n$. We illustrate the approach in the following."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bZIdxq3X6jYn"
   },
   "source": [
    "## Data Splitting: Out-of-sample performance\n",
    "\n",
    "Now that we have seen in-sample fit, we evaluate our models on the out-of-sample performance:\n",
    "- Use the testing sample for evaluation. Predict the $\\mathtt{wage}$  of every observation in the testing sample based on the estimated parameters in the training sample.\n",
    "- Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "xsJsIKetx3Rp"
   },
   "outputs": [],
   "source": [
    "# We will use smf.ols just to get the full data frame and sm.OLS to\n",
    "# test out of sample for convenience\n",
    "# This is because predict is a bit tricky with smf.ols out of sample.\n",
    "tmp = smf.ols(model_base, data=df)  # just to extract df, not actually using this model\n",
    "X_full = tmp.data.exog\n",
    "y_full = tmp.data.endog\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "q9U8LYmL6jYn"
   },
   "outputs": [],
   "source": [
    "# estimating the parameters in the training sample\n",
    "regbasic = sm.OLS(y_train, X_train).fit()\n",
    "\n",
    "# predict out of sample\n",
    "yhat_reg_base = regbasic.predict(X_test)\n",
    "\n",
    "# calculating out-of-sample MSE\n",
    "MSE_test1 = sum((y_test - yhat_reg_base)**2) / y_test.shape[0]\n",
    "R2_test1 = 1. - MSE_test1 / np.var(y_test)\n",
    "\n",
    "print(\"Test MSE for the basic model: \" + str(MSE_test1))\n",
    "print(\"Test R2 for the basic model: \" + str(R2_test1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "azUeNf8G6jYn"
   },
   "source": [
    "In the basic model, the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  is quite closed to the  $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "B9b9XAix6jYo"
   },
   "outputs": [],
   "source": [
    "# We will use smf.ols just to get the full data frame and sm.OLS to test out of sample for convenience\n",
    "# This is because predict is a bit tricky with smf.ols out of sample.\n",
    "tmp = smf.ols(model_flex, data=df)  # just to extract df, not actually using this model\n",
    "X_full = tmp.data.exog\n",
    "y_full = tmp.data.endog\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, shuffle=True)\n",
    "\n",
    "# estimating the parameters in the training sample\n",
    "regflex = sm.OLS(y_train, X_train).fit()\n",
    "\n",
    "# predict out of sample\n",
    "yhat_reg_flex = regflex.predict(X_test)\n",
    "\n",
    "# calculating out-of-sample MSE\n",
    "MSE_test2 = np.mean((y_test - yhat_reg_flex)**2)\n",
    "R2_test2 = 1. - MSE_test2 / np.var(y_test)\n",
    "\n",
    "print(\"Test MSE for the flexible model: \" + str(MSE_test2))\n",
    "print(\"Test R2 for the flexible model: \" + str(R2_test2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-EEZIiCF6jYo"
   },
   "source": [
    "In the flexible model, the discrepancy between the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  and the  $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$  is not large.\n",
    "\n",
    "It is worth to notice that the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.\n",
    "\n",
    "Nevertheless, we observe that, based on the out-of-sample  $𝑀𝑆𝐸$ , the basic model using ols regression performs is about as well (or slightly better) than the flexible model.\n",
    "\n",
    "Next, let us use lasso regression in the flexible model instead of ols regression. Note that the out-of-sample  𝑀𝑆𝐸  on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "05auay496jYp"
   },
   "outputs": [],
   "source": [
    "# train model using Lasso with cross validation and variable normalization\n",
    "lasso = Pipeline([('scale', StandardScaler()),  # standardize the variables\n",
    "                  ('lasso', lm.LassoCV())])\n",
    "lasso.fit(X_train[:, 1:], y_train)\n",
    "\n",
    "# predict out of sample\n",
    "yhat_test_lasso = lasso.predict(X_test[:, 1:])\n",
    "\n",
    "# calculating out-of-sample MSE\n",
    "MSE_test3 = np.mean((y_test - yhat_test_lasso)**2)\n",
    "R2_test3 = 1. - MSE_test3 / np.var(y_test)\n",
    "\n",
    "print(\"Test MSE for the basic model: \" + str(MSE_test3))\n",
    "print(\"Test R2 for the basic model: \" + str(R2_test3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "uO6YxpPw6jYp"
   },
   "outputs": [],
   "source": [
    "# store the results in a table\n",
    "res_df2 = pd.DataFrame()\n",
    "\n",
    "res_df2['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']\n",
    "\n",
    "res_df2['$MSE_{test}$'] = [MSE_test1, MSE_test2, MSE_test3]\n",
    "res_df2['$R^2_{test}$'] = [R2_test1, R2_test2, R2_test3]\n",
    "\n",
    "# Show results\n",
    "res_df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UJUxIXcP6jYq"
   },
   "outputs": [],
   "source": [
    "# print to Latex\n",
    "print(res_df2.style.to_latex())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Cq5ZEW1k3b4N"
   },
   "source": [
    "## Extra flexible model and Overfitting\n",
    "Given the results above, it is not immediately clear why one would choose to use Lasso as results are fairly similar. To motivate, we consider an extra flexible model to show how OLS can overfit significantly to the in-sample train data and perform poorly on the out-of-sample testing data.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3JVlk3Lg3ujb"
   },
   "outputs": [],
   "source": [
    "# Extra Flexible Model\n",
    "model_extra = ('lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)**2')\n",
    "tmp = smf.ols(model_extra, data=df)  # just to extract df, not actually using this model\n",
    "\n",
    "# In-sample fit\n",
    "insamplefit = tmp.fit()\n",
    "rsquared_ex = insamplefit.rsquared\n",
    "rsquared_adj_ex = insamplefit.rsquared_adj\n",
    "mse_ex = np.mean(insamplefit.resid**2)\n",
    "mse_adj_ex = insamplefit.mse_resid\n",
    "print(f'(In-sample) Rsquared={rsquared_ex :.4f}')\n",
    "print(f'(In-sample) Rsquared_adjusted={rsquared_adj_ex :.4f}')\n",
    "print(f'(In-sample) MSE={mse_ex :.4f}')\n",
    "print(f'(In-sample) MSE_adjusted={mse_adj_ex:.4f}')\n",
    "\n",
    "# Train test Split\n",
    "X_full = tmp.data.exog\n",
    "y_full = tmp.data.endog\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, shuffle=True)\n",
    "\n",
    "# estimating the parameters in the training sample\n",
    "regextra = sm.OLS(y_train, X_train).fit()\n",
    "\n",
    "# predict out of sample\n",
    "yhat_reg_extra = regextra.predict(X_test)\n",
    "\n",
    "# calculating out-of-sample MSE\n",
    "MSE_test4 = np.mean((y_test - yhat_reg_extra)**2)\n",
    "R2_test4 = 1. - MSE_test4 / np.var(y_test)\n",
    "\n",
    "print(\"Test MSE for the flexible model: \" + str(MSE_test4))\n",
    "print(\"Test R2 for the flexible model: \" + str(R2_test4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UkZBMW3o40GV"
   },
   "source": [
    "As we can see, a simple OLS overfits when the dimensionality of covariates is high, as the out-of-sample performance suffers dramatically in comparison to the in-sample performance.\n",
    "\n",
    "Contrast this with Lasso:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qUxBlRLU9d-d"
   },
   "outputs": [],
   "source": [
    "np.sum(lasso.named_steps['lasso'].coef_ != 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dUsSUi0H3ct3"
   },
   "outputs": [],
   "source": [
    "# train model using Lasso with cross validation and variable normalization\n",
    "lasso = Pipeline([('scale', StandardScaler()),  # standardize the variables\n",
    "                  ('lasso', lm.LassoCV())])\n",
    "lasso.fit(X_train[:, 1:], y_train)\n",
    "\n",
    "# predict in sample\n",
    "yhat_train_lasso = lasso.predict(X_train[:, 1:])\n",
    "\n",
    "# Calculate R-squared\n",
    "R2_L = 1 - np.sum((yhat_train_lasso - y_train)**2) / np.sum((y_train - np.mean(y_train))**2)\n",
    "\n",
    "# Calculate adjusted R-squared\n",
    "pL = np.sum(lasso.named_steps['lasso'].coef_ != 0)\n",
    "ntrain = len(X_train)\n",
    "baseline = np.sum((y_train - np.mean(y_train))**2) / (ntrain - 1)\n",
    "R2_adjL = 1 - (np.sum((yhat_train_lasso - y_train)**2) / (ntrain - pL - 1)) / baseline\n",
    "\n",
    "# Calculate Mean Squared Error (MSE)\n",
    "lasso_res = y_train - yhat_train_lasso\n",
    "MSEL = np.mean(lasso_res**2)\n",
    "\n",
    "# Calculate adjusted MSE\n",
    "MSE_adjL = (ntrain / (ntrain - pL - 1)) * MSEL\n",
    "\n",
    "# Print the results\n",
    "print(\"R-squared for the lasso with the extra flexible model (in-sample):\", R2_L)\n",
    "print(\"Adjusted R-squared for the extra flexible model (in-sample):\", R2_adjL)\n",
    "print(\"MSE for the lasso with the extra flexible model (in-sample):\", MSEL)\n",
    "print(\"Adjusted MSE for the lasso with the extra flexible model (in-sample):\", MSE_adjL)\n",
    "\n",
    "# predict out of sample\n",
    "yhat_test_lasso = lasso.predict(X_test[:, 1:])\n",
    "\n",
    "# calculating out-of-sample MSE\n",
    "MSE_test5 = np.mean((y_test - yhat_test_lasso)**2)\n",
    "R2_test5 = 1. - MSE_test5 / np.var(y_test)\n",
    "\n",
    "print(\"Test MSE for the basic model: \" + str(MSE_test5))\n",
    "print(\"Test R2 for the basic model: \" + str(R2_test5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1DmQoTrH5OUV"
   },
   "source": [
    "As shown above, the overfitting effect is mitigated with the penalized regression model."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
